My Introduction:

Name: Sunny Kumar Vaishnov

Student number: S3822295

Group number 165(Individual Only)

Phase 1 report of MATH 2319(Machine Learning)

Course Coordinator & Course Lecturer: Dr. Vural Aksakalli

INTRODUCTION

I have the following dataset for the Math2319(Machine Learning) Phase 1 Assignment. The objective of this House Rent Prediction dataset is to predict the monthly rent prices of available homes which are based on based on various explanatory variables describing aspects of residential houses.It is sourced from Kaggle.com The dataset comprises of 265190 house records with 22 columns

The report showcase following results:

  1. Overview of Dataset
  1. Data wrangling which covers importing of dataset, sampling of dataset as it contains too many rows, finding and removing missing values, removing irrelevant columns.
  2. Data description which contains summary of categorical variables as well as numerical variables.
  3. Data visualization as per
  1. Summary of the report
  2. Refrencing

Overview of Dataset

Data Source

The dataset is named as House Rent Prediction and it comprises of 265190 house records with 22 columns. It is sourced from kaggle.com[https://www.kaggle.com/rkb0023/houserentpredictiondataset] (house-rent-prediction-dataset. (2021). Retrieved 11 April 2021, from https://www.kaggle.com/rkb0023/houserentpredictiondataset).

Project aim

The main of aim of this project/report is to import the data, clean it so that it can be used in Machine learning algorithms to predict the monthly rent and highlights the descriptive features of this data and show the details of the data through data visualizations.

Target Feature

Our main aim is to create a model that will predict the monthly rent by using the other variables. For that our target feature is price variable which was later renamed as rent price.

Descriptive Features of the Dataset

This dataset contains details about price information of the different types of house listed on various websites in various area of United kingdom.
The dataset has 265190 house records with 22 columns. The Following columns are listed below:

Id: listing id.
url: listing URL
region: craigslist region
region_url: region URL
price: rent per month (Target Column)
type: housing type
sqfeet: total square footage
beds:number of beds
baths:number of bathrooms
cats_allowed: cats allowed boolean (1 = yes, 0 = no)
dogs_allowed: dogs allowed boolean (1 = yes, 0 = no)
smoking_allowed: smoking allowed boolean (1 = yes, 0 = no)
wheelchair_access: has wheelchair access boolean (1 = yes, 0 = no)
electric_vehicle_charge: has electric vehicle charger boolean (1 = yes, 0 = no)
comes_furnished: comes with furniture boolean (1 = yes, 0 = no)
laundry_options: laundry options available (1 = yes, 0 = no)
parking_options: parking options available (1 = yes, 0 = no)
image_url: image URL
description: description by poster
lat: latitude
long: longitude
state: state of listing

All the descriptive features listed above are self explantory.

Data Wrangling

Preliminaries

To begin with data pre-processing we have to import certain functions in python. The Function imported are warnings which is used to ignore the warnings while pre-processing.
Next, we use pandas and numpy which are very helpful in cleaning the dataset.Next, matplotlib.pylpot,Seaborn,altairand plotly is used to create the plots and their magic funtions are also imported.

Here we are importing the csv of the dataset which is known as Housing.train and head() is used to display few intital observations of the dataset. The dataset is named as df.

df.shape() is used to see the shape of the dataset.Here, we can observe that the dataset contains 265190 rows and 22 columns

As the dataset contains more than 5000 rows, we have to take a subset of the dataset.Here, we taking a random sample of 5000 rows of the dataset by using Sample function. The new subset is named as newdf.

We are using print function to display the rows and columns of subset of dataset. we can see that there are exactly 5000 rows and 22 columns in the new subset which is created by sampling function.

dtypes is used to see the data types of the subset. Here we can see that there are 9 categorical variables, 10 integer and 3 float variables present.

Here we are displaying the missing values in the available in the dataset. We can see that there are 1022 values in laundry_options, 1819 in parking_options , 26 each in both lat and long variables.

parking_options contains 36% of missing values, laundry_options contains 20%,while lat and long variables contains 5.2% of missing values.

To proceed further in our data visualization and later on in building the Machine Learning model, we have to deal with these missing values otherwise they will create a lot of headache for us. There are various ways to deal with them such as imputing with mean, median or mode,etc or else we can simply remove them from our dataset. Here, we will use dropna() function to remove rows containing these missing values. We will remove the values and allot them into a new variable new_df.

In the above, we can see that the new_df contains 3065 rows and 22 columns. Earlier it was 5000 rows and 22 columns. this shows that all the rows containing missing values have been removed from our dataset.

Just to be sure we are using isnull().sum() function to check if there are any missing values still present in our dataset or not.

As we can see that there are missing values remaining in our dataset.

Now to proceed further, we are removing irrelevant columns present in our dataset by using drop() function and naming the new variables as Freshdata.

We have removed id,url,region,region_url,image_url,description,state ,lat, long these columns as they were of no use to us.

We are using columns() function to display the available column in our new dataset.

As we can see that all the irrelevant column have been removed from our dataset.

Now, we will change the names of the columns for our convenience and naming the dataset Finaldf for further use.

Below we are using columns functions again to check the new column names.

Data Description

To check the first 5 rows of our refurbished dataset, we are using head() function.

To check the last 5 rows of our refurbished dataset, we are using tail() function.

To check the information about our finaldf we are using info() function.

Above we can see that we have 1 float variables, 9 integers and 3 object in our dataset.

Below we are displaying the summary of our Continous features.

Here we are displaying the summary of our categorical features.

Here we are can the summary of our Float Features.

Below we are highlighting the unique values that are present in our categorical variables.

Here, we are using groupby function to see the individual kinds of parking options available in as per our type of houses. Such as For Apartments, 1317 off-street parking are available, 470 carport and so on.

Here, we are using groupby function to see the individual kinds of Laundry options available in as per our type of houses. Such as For Apartments, 1068 w/d in unit laundry options are available, 512 laundry on site and so on.

Below we call all the boolean variables available in our dataset. Boolean variables are those variables which values are "yes" or "no" but in dataset they are recorded as "0" and "1". Here The boolean variables are cats allowed,dogs allowed, smoking allowed,wheelchair access,electric vehicle charge and furnished.

Data Visualizations

Now, we will show the data visualization by using one variable, two variable and three variable by plotting two graphs of each type.

By using one Variable

Below is Boxplot(Fig 1) which is showing the Average Rent price. We have use plotly function to create the Boxplot.

The average rent as per the box plot is $1102, it is not the exact average rent as it clearly visible in the boxplot that there are various outliers(high rent). We can remove the outliers present in the dataset, but we will not as the outliers are the value of rent prices which can be high and by removing them, it can alter our dataset and our further prediction.

Fig 2 is showing the different number of places available for rent as the parking options.

Fig 2 displays various number of places to live as per different parking options.

Data Visualizations using two Variables

Fig 3 is a scatter plot which highlights the relationship between rent prices and the size of place.

As it is not clearly visible, but with we can say that there is positive linear relationship between rent price and the sqfeet of the place. A place which of 1189 has rent of USD 489, while 3220 sqfeet has rent of USD 5000. In another word, if a place is has larger sqfeet(area), then it has high price as compared to place which has less sqfeet.

In Fig 4, a boxplot is displaying a relationship between Number of bedrooms in a accomodation and its price by showing the average prices.

In fig4, we can see that if a place has more bed room then the price is higher. In another word, as the bedroom in a place increases, it's price also increases. A place with 0 Bedroom has a Average rent of 4831, while a place with 1 bedroom has a rent of $970 and so on.

Data Visualizations using 3 variables.

Fig 5 contains a scatter plot which shows the price of different houses as per their parking options.

In fig 6, we can a Bar chart which is exhibiting information regarding the different prices of houses as per their laundry options and number of bedrooms.

We can see that if number of bedrooms increases, then the prices of the places also increases and every present laundry options.

Summary

The main aim of this report is to import the dataset, describe the variables available in the dataset, make the dataset neat and clean for future use and showing the descriptive visualizations of the variables. First we provide a descriptive summary of the features available in dataset, and then perform data pre-processing by importing the dataset. After importing we saw the shape of dataset which shows that dataset has 265190 rows and 22 columns. After that, i took a subset of dataset containing 5000 rows only and then we see the datatypes of the variables. After that i remove the missing values present in the dataset as they can cause some issues in future.once the missing values are removed we can see that there 3065 rows and 22 columns present in the dataset. After that i renamed the columns and named the dataset as Finaldf. Then we see the summary of all variables of dataset and the unique values present in the categorical columns. Then we begin the visualizations. First we show a boxplot of average rent price which says that the average rent is $1102, it is not the exact average rent as it clearly visible in the boxplot that there are various outliers(high rent) Fig 2 is showing the different number of places available for rent as per different parking options. Then visualization using 2 variables, initial a scatter plot showing a positive linear relationship between rent prices and size of place, then a boxplot which shows that as the number of beds increases, price of accomodation also increases. Then we proceed to visualization by using three variables. Fig 5 is a scatter plot which shows the price of different houses as per their parking options and fig 6 is Bar chart which shows the prices of each unit as the number of bedrooms increases different laundry options. That's all we have done in this report/Phase 1 After that we will proceed to Phase , which is prediction of rent.

Refrencing

  1. The dataset is imported from House Rent Prediction dataset
    house-rent-prediction-dataset. (2021). Retrieved 11 April 2021, from https://www.kaggle.com/rkb0023/houserentpredictiondataset
  1. Feature Selection and Ranking in Machine Learning | www.featureranking.com. (2021). Retrieved 12 April 2021, from https://www.featureranking.com/